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Their utility in large-scale educational program evaluation is 
discussed. The' exapnination of these methodological developments 
indicates how people (evaltiators, methodologi sts , agency staff) 
involved in the design and conduct of large-scale program evaluation 
might approach decisions concerning appropriate methodology and its 
proper use. Evaluation activities and the range of methodological 
issues considered are: field-based investigations of large-scale 
programs; evaluation of on-going programs and of various forms of 
social experiments; and both well-defined and broad-based educational 
programs. Both analytical procedures employ explicit models of the 
phenomena believed to be responsiJB¥^f or the difficulties in 
estimating program effects. Both are also adaptable to situations 
where there are no specific comparison or control groups and where 
panel d^ta exists on program participants. A discussion o£ the LISREL 
model and its limitations as an analytical approach to estimation in 
structural equation modeling with latent variables is included. 
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* ' ' ^ Introduction 

The following report selectively examines recent developments in 
quantitative methodology and considers their possible utility in large-scale 
program evaluations in education. At the outset we limit attention to two 
specific categories of analytical methods:' structural equation modeling and 
selection modeling and related issues in analysis of quasi-experimental 
data (non-equivalent control group designs), Wh\^e 
these . topics, by.no jneans, cover the full ra^^ge of recent advances 
in the technology for analyzing quantitative data in lar^e-scale program 
evaluations, they are representative of the methodolo^itr^ concerns that 
arise in such inve?tigations, the means analysts propos^ to deal with the 
concerns, and the strengths and. limitations of primarily technical approaches 
to resolving ambiguity in evaluation results. As such our examination of 
these methodological developments is intended to suggest how persons (evalua- 
tors, methodologists, agency staff) involved in the design and conduct of 
large-scale program eval^uation might approach decisions about appropriate 
methodology and its proper use. 

Delineation of Relevant Program Evaluations 

We further delineate the purview of this investigation by stating the 
types of evaluation activities and the range of methodo1og\^l issues to be 
considered. We are concerned with field-based investigations of large- 
scale programs. typically approved b/^ legislative acti'bns and implemented 
(or to be implemented) by governmental agencies. Both evaluations of on- 
going programs (?,g,, Title I) and of various forms of social experiments - 
(e,g,, Negative Income Tax experiments) are relevant to the present 



discussion (Cook (1^81) restricts his attention to the former). The domain 
also encompasses both well-defined programs (i*e*, those with a discrete 
number of specific program alternatives such as the various models in opera- 
tion in Planned Variation Follow Through) and broad-based educational reforms 
as represented by Title I, the Emergency School Aid Act (ESAA), and bilingual' 
education. (A related paper (Burstein, 1981) focussed strictly on evaluations 
of well -defined programs). 

Types of Evaluation Questions 

The limits placed on the evaluation activities of interest are in the 
kinds Qf questions one seeks •to answer^and the form of data collection in 
the evaluation* Cook (1981) discusses six types of questions that evaluators 
try to answer: 

1) Who are the clientele and service providers and to what extent are 
target -groups among the clients? (Demography) 

2) What are the delivered services and the contexts in which services 
are received? (I-mplemeritation) 

3) How do program services affect clients in both expected and unexpec- 
ted ways? (Effectiveness) 

4) How are other elements (teachers, schools, .families, etc.) of the 
educational system affected by the program services? (Impact) 

5) Why do program services affect outcomes in the way they do? (Causation) 

6) What are the costs of the services and how cost-effective are 
different ways of achieving a particular result? (Economic costs) ' 

The questions about effectiveness, impact, and causation are^central 
to our examination. To be comprehensive, investigations of these types of 
questions require information about the characteristics of the program, 
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its clients and participants and the context in which it is implemented, 
the educational and social processes ^intended and actual) occurring within 
program sites, ancj the outcomes of programs at various levels (student, 
teacher, classroom, school, coimunity, etc) of the educational system. 
Conceptual and analytical machinery are then employed to elucidate the 
linkages and connections among the various sources of information. 

Types of Data Collections 

In the past, most large-scale field evaluations of educational programs 
collected mainly "quantitative" measures of program characteristics and out- 
comes largely derived from survey questionnaires completed by clients and 
other relevant program participants (e,g,, teachers, principals, parents), 
1 incited interviews with program personnel and observations of^program acti- 
vities (e.g., Stal lings and Kaskowitz, 1974), and paper-and-penci 1 measures 
of cognitive and affective outcomes. Data were collected from multiple sites 
for each variant of the program to achieve a given degree of information 
about program variation and a sufficient number of observations for statis- 
tically powerful tests of program effects. 

Recently, however, data collection in even large-scale program evaluations 
has taken on an increasingly "qualitative'* character. Extended case studies 
were conducted in either a subset or all sites in a number of recent large- 
scale^aluations (e.g.. Title I Parent Involv^ent Study conducted by SDC; 
Study of the Longitudinal Effects of the California Early Childhood Education 
Program conducted by CSE, the Rand Study of Federal Programs Supporting 
Educational Changes^ the evaluation of Curriculum Development Projects in 
Science Education conducted by CIRCE). At the least, the inclusion of 
case studies in these evaluations provide a richer picture of program process 

b" 
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than was obtainable from strictly questionnaire information. And, as methods 
for synthesizing multiple case studies and integrating qualitative and quan- 
titative information improve, qualitative methods will play an increasingly 
more prominent role in the reportoire of evaluation activities previously 
concentrated on less dense forms of data collection. 

Despite the increasing role of qualitative methods and our positive 
attitude about their central role in future evalua.tions,. the remainder of 

the paper will restrict attention to developments in quantitative methods 
* 

from multi-site investigations using questionnaire, fntervievr, test and per- 
haps small-scale observational data. We impose this restriction for two 
reasons. First, the^ analytical developments considered are appropriate pri- 
marily for the more traditional kinds of quantitatively oriented studies. 
Second, others (e.g., Daillak & Alkin, 1981) are more capable at this point 
of stating the case for qualitative methods. 

Overview of the Report 

The remainder of the report will proceed^as follows. First, a general 
overview of current perspectives on the design and conduct of large-scale . 
program evaluations is presented. The intent is to explain why the climate 
for future large-scale evaluations is conducive to the introduction of im- 
proved methods of analysis. Second, two specific categories of analytical 
methods (structural equation modeling, and selection modeling/analysis of 
non-equivalent control group designs) are considered. The basic con- 
ceputal and analytical foundations for each method are described, issues 
that motivate its use in program evaluations are delineated, and specific 
strengths and weaknesses of each method in program evaluation contexts 
are identified. 
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X Current Perspectives on Design and Analysis 

in^Large-scale Program Evaluations 

There are strong signs that large-scale educational evaluation has 
witnessed the end of an era. From the late *60*s and throughout the 1970's, 
the federal government, under legislative mandate, mounted major evaluations 
of just about every conceivable educational program, Wargo (1977) 
points to 110 major evaluations of federal educational programs funded by 
the Office of Planning, Budgeting, and "Evaluation of the Office of Education 
at a cost of over $80 million during the 1971-1979 peHod. The figure does 
not even include all the rfiajor evaluations done by the Office of Education, 
much less NIE and other branches of HEW, 

Man^' of these large-scale multiyear studies have been highly visible 
in the educational community though their direct influence on legislative 
-action is less clear (Barnes & Ginsberg, 1979; Cohen & Garet, 1975; Cross, 1979 
Wisler & Anderson, 1979). In most cases, the debates about the quality and merits 
of these evaluations have been heated. This has especially been the case 
for evaluations of compensatory programs such as Head Start (e.g., Cicirelli 
et al., 1969, 1971; Smith & Bissell, 1971), Project Follow Through (Anderson, 
1976; Cline et al., 1974; Haney, 1977a, 1977b; House, Glass, McLean, & Walker, 
1978; 'Stebbins et al . , 1977), and Bilingual Education (AIR, 1979; Center for 
Applied Linguistics, 1979). The literature on evaluations of these programs 
is replete with critiques, reanalyses, and secondary analyses, not to mention 
the often self-serving attacks from program advocates and critics. 
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Signs of Chajige . ^ . 

Emphasis . There are clear signs, however, that the Targe-rscale evalua- 
tions of the 1980's may well be different. Firsts recent scholarly (e.g,, 

» 

Cook, 1981; Cronbach, 1978; Cronbach & Associates, 1980; House, 1977, 197^; 
Raizen & Rossi, 1981) and policy (e.g,, Boruch & Cordray, 1980) contribu- 
tions provide well-reasoned accounts of the complexity of program evalua- 
tions in highly politicized cpntexts and persuasive arguments for different 
views of evaluation's role in the formation of social policy. These writings 
urge that less emphasis be placed on the traditional social science/experi- 
mental design paradigm for impact evaluation while, more effort be devated 
to describing and explaining the processes of educational programs and their 
consequences over a broad range of outcomes. The overly simplistic overall 
program impact question (i.e., does program A affect pupi 1. outcomes?) that 
guided so many of the OPBE funded studies (e.g.., ESAA (Coulsor^ et al,; 1977); 
Follow Through (Stebbins et al., 1977); and Bilingual Education (AIR, 1979)) 
appears to be on the decline. 

Instead, recent evaluations involve more direct efforts to investigate 
and describe the consequences (intended and otherwise) of educational pro- 
grams. This "information" as characterized by Cronbach et. al, -(1980) 
involves a "move away from stand-alone evaluations of programs *and toward ^ • 
a more synoptic view of the numerous programs that address the same social 
programs" (p. 72-73) and they urge that evaluations employ multiple studies 
using different strategies to investigate subquestions and that the evalDa- 
tion plan evolve as individual studies expose uncertainties more clearly. 
The NIE Compensatory *Educatior> Study (NIE,^ 1977) and the evaluations of' 
services to handicapped children*under Public Law 94-142 (Bureau of . 



EducatiOTF for the Handicapped, 1978) iire'clear examples- of this type 
of evaluation. - / 

^ " ' 1 ^ ' 

This shift in evaluation emphasis' is a logical response to the 
findings that variation in implementation within a program is generally 
greater than between * programs (Stebbins»et al . , 1977'), new program 
"treatments" are quickly diffused to non-participating groups (schools, 
etc.) (Coulson, 1978) and that the! effects that are discerned depend 
on the characteristics of the prpgram processes "as implemented" 
rather than on the ascribed program characteristics (Cook, 1981; 
Cronbach, 197^; Cronbach & Associates^ 1980; Rogosa, 1978). Under such 
con^iitlons, only those evaluation activities that delve beneath the > 
surface descriptions of programs can be expected to generate , quality 
information for policy formation. 

Methodological improvements . Clearly, the impetus for change in 
the conduct of large-scale educational evaluation exists. The philo-. 
sophical, theoretical, and political bases for the changes have been 
and are being articulated. Under such conditions, the clima.te for 
evaluation in the 198Cf's is quite qpe'n to new designs and strategies 

A 

for evaluating the effects of educational programs. The task of defining 
these designs and strategies and illustrating their worth remains. ^ • 
Fortunately, it is unnecessary to begin from scratch in the design 
of large-scale evaluations for the 1980's. While actual educational 
evaluations over the past decade, for the most part, utilized pre-1970's 
technology (quantitative methodology, psychometric methods), investments 
of resources in basic research on methodology and measurement during 
the 1970's led to substantial improvements in the state of the art. 



The relatively unsophisticated applications of experimental, quasi- 
experimental, and non-experimental jnethods that led to the findings of 
the Coleman report (Coleman et al . , 1966) and of early Head Start and 
Follow Through evaluations ne'ed not be repeated, .Better, more sensi- 
tive quantitative methodology is. now available and is more suited to 
the shift in emphasis it! large-scale evaluations. , * 

The same can be 'said for the measurement of program outcomes and 
processes. Approaches for developing program sensitive test instru- 
-ments as well as a broader view of the range of program outcomes are 
currently on the evaluation agenda. The investment of resources to 
obtain more inteilsive and descriptive measures of pr9gram implementation 
and processes appears to be a standard feature of recent large-sca^le 
evaluations (e.g.^ the Title I Parent Involvement Evaluation 
conducted by Systems Development Corporation). These measurement 
strategies should facilitate more useful evaluations. Better methods of 
knowledge and data synthesis (e.g., recent work by Glass a,nd Light) 
should also contribute to better' evaluations. . 
Basis for Methodological Improvements ^ ^ 

The special issue of the Journal of Educational Statistics on the . 

^mergency School Assistance Aet^(ESAA) Evaluation ^JES, 1978; see 
especially Rogosa, 1978) and Cronbach's repor/t'on designing educational 

.evaluation (1978) provide documentation of key evaluation methodology 
issues and help to motivate our general concerns. The basis for our 
investigation into evaluation methodology, is in part the fol lowing .set • 
of general premises: • . • 



(1) Evaluation is inevitably an empirical enterprise, "examining eveAts 
in sites where the program as tried and the reactions and subsequent 

performance of the persons' served (as such it) is typically 

'identified with the application of social science methods: 
observation, ^measurement, and/or use of informants." (Cronbach, 
1978, pp. 25-26); ^ > ' 

(•2) "The success of an evaluation effort should be measured by its » 
^ social usefulness* 6r utility .... Technical decisions should not • 

be made independently of the political and social context of an 
^ evaluation. The central question- is: How can we design, analyze 
and report evaluations so as to make them maxima.lly useful?" 
(RogoSa, 1978, p. 80, emphasis added). ' 

(3) "Evalueitors are unwise to collect 'data only on pretest and posttest 
achievement measures^ or conduct analyses that only determine the 

' statistical significance of the overall treatment effect. Additional 
data on process, and on program realization, are essential for 

' adequate, descriptions of programs' operating in .complex settings," 
(Rogosa, 1^80, p. 81). ^ ' 

(4) The*ainalytical strategies in program evaluations ^hould b,e adapted 
to the "substantive problems under investigation rather than adapting 
the evaluation of program Impact to fit 'the analytical methods, 

✓Natural designs and analysis. should evolve from the structure and 
function of the program. (Burstein, 1980). 

(5) Program evaluation is typically carried out within a multilevel 
educational context. Program activities occur in the groups (class- 
rooms, schools," etc.) to which an indivtdffal belongs. .These groups 



10 . 

influence the thoughts, behaviors, and feelings of their members, 
(Bjirstein, 1980), 

(6) Educational interventions are typically implemented within on- 
going programs. They vary in "fit" with existing activities and 
predilections -and vary in duration.. Interventions in social 
settings are inherently dynamic activities. 

There are more specifi^methodological corolla?^ies tq these general 
premises: 

(1) "No one level is^uniquely responsible for the delivery of and 
response to educational programs confining substantive ^ 
questions to any one level of analysis is unlikely to be a produc- 
tive research strategy" (Rogosa, 1978, p, 83). Thus, attempts to 
answer questions about the effects of educational programs re- -^-'^^ 
quire analyses at and within the levels of the ei^ucaffonal 
hierarchy (*Burstein, 1980) ♦ ^■ 

(2) Even when one starts with a controlled experiment with random 
assignment, features of the'^experimental design break down through 
processes of attrition, contamination, and differential penetration 
of the treatment. Under such conditions, quasi-experimental forms 
of adjustment and -control are inevitably necessary^ and thus should 
be anticipated as part of the evaluati'bn design. 

(3) In the course of an educational program, students are members of 

multiple groups (e.g.; classes). The features of these group 

contexts and the consistency of student^s educational experiences 

within them over time warrant consideration for dynamic modeling 

, of program experiences (Burstein, 1981; Tuma, Hannan, & Groenfeld, 

, » • 

V 1978; Rogosa, 1980). 



C4) In field experiments^wi th well-defined treatments, the variation 
in the fidelity of program practices with tgaeher (school, etc.) 
.pred fleet tons and skills .leads to a continuous range of program 
processes. Under these conditions, modeling the intervention 
as a dichotomous rather than a continuous event is 'an insufficient 
approach for investtgati/ig program 'effects (Burstein, 1981; Cronbach, 
1978; Rogosa, 1978). 

(5) Even when random assignment occurs at some aggregate level (e.g., ^ 
school), the variation in the treatment effects for students w'ithin 
aggregates needs to be investigated, especially in terms of its con- 
sequences for the equalization of educational opportunity. 

(6) Programs have multiple effects. Multiple measurement is needed to 

- encompass intended and utitirtended effects (desirable or undesirable), 
^, (Cronbach, 1978^ p. 26). 
Fortunately, one can point to specific bodies of methodological work 
that are responsive t!o both the .general perspectives and the accompanying 
methodological corollaries. In the following sections we will elaborate 
the connections for a selected set of methodological strategies. 

^Examination of Specific Analytical Developments 

The analytical methods to be examined represent broad areas of methodo- 
logical concerns that first developed within social sciencg research in . " 
general. To understand why this is both an obvious and proper starting 
point, one nefed only consider the criteria used to del ineate'our relevant 
universe of large-scale program evaluation. In particular we are interested 
in design and analytical problems in evaluations that fit the following 
description: 



(1) The evaluation should have been conducted on a distinct funded educa- 
tional program(s) ratKer than be a general* shift in the behaviors of 

an educat;ional system. There must have been some form of intervention, 
inf)ovci*fon, or change in the ongoing educational program*, 

(2) The evaluation must have involved multiple sites of each presumably 
'distinct program type. 

^ 

(3) The« program must have been implemented (i,e,, the main program activi- 
ti^s must operate) at the level of the school or lower, 

(4) Both outcome and program process data must have been collected during 
the course of the evaluation, 

(5) Outcome data must be available over multiple time points, 

(6) Gaod documentation of the original evaluation must exist. 

The above delimiters eliminate evaluations which are short-term efforts, 
have a limited number of sites, or are of programs presumably constant over 
all schools in a district. These criteria include evaluations of well- 
defined program interventions such as in provided by a specific Head Start 
or Follow Through model, interventions that are less specific in program 
prescription but nonetheless are assigned to "sites" in a systematic manner 
such as by random assignment (e,g,, the ESAA Evaluation) and more pervasive 
social interventions where participants are essentially ali persons with a 
prescribed set of characteristics (e,g,. Title I, 'Bilingual Education). 

To gain a better perspective on the kind of study situation evnisioned . 

c 

consider .the following modified version of the conceptual framework for 
investigating the impact of educational reforms outlined in Burstein (1981). 
One starts by identifying the specific elements of educational and social 
^systems In which programs are introduced and the processes and outcomes 
that result. The elements are the characteristics and attributes of individua 
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Students, families, groups of students, teachers ,"^0! asses, groups of teachers^ 
schools, and communities. The processes are developmental, instructional, 
curricular, psychological, interpersonal, and social. Both elements and 
processes can take on either static or dynamic properties though the latter 
are more likely in school settings, especially those with large numbers of 
poor children participating in school reform programs. 

A general model containing the essential elements and processes 
of the conceptual framework is as follows. The interrelations among 
five distinct classes of variables are incorporated in the model: 
program instruction, schooling context (class, school, cotmunity, etc.), 

stedent entering characteristics, and student performance. 
Each class may represent many distinct variables (or sets of variables). 
For example, "instruction" refers to the various characteristics of the 
instruction a student receivfes^in a specific classroom or school.- Par- 
ticular teacher attributes (e.g. , •warmth, enthusiasm, clarity of presenta- 
tion) and instructional processes X^.g., structure, grouping,- pacing, types 
of reinforcements, teachers' questioning behavior, quality and variety 

"of instructional materials) both fit under the instruction rubric. Certain 
aspects of the instructional practices also provide evidence about the 
degree of program implementation. Nonetheless, any measure of program 
implementation would still .fall within the "instruction" category for pre- 
sent purposes. - 
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The term student "performance" j*s meant in the broad sense; the full 
range of educational, social, and psychological outcomes fit under this 
general rubric. The restriction to student outcomes could be broadened to 
include other units (teachers, classes, schools), but not wi thout^^king the 
task of generating the framework even more unwieldy than it will appear 
here. 

The role of schooling context in the model is mul tifaceted. Its most 
proximal manifestations are in the classroom where the program is imple- 
mented. For example, the overall level and heterogeneity of ability in a 
class places constraints on instructional content, organization, and manage- 
ment. The consequences of these constraints vary for different reform 
programs. Class heterogeneity places a strain on time and resources in 
individually prescribed educational programs; Decisions about the pacing 
of instruction become more difficult in pragrams emphasizing large group 
instruction. ' 

The student's role within the classroom is also directly influenced 
by its composition (Burstein, 1980b; Firebaugh, 1980; Webb, 1980). There is 
obviously a complicated balance between having classmates compatible in 
ability and temperafnent versus having peers that are more or less able and/ 
Dr have contrasting personalities. Either combination liiight foster intellec- 
tual, social, and psychological growth under the "right" conditions. Here, 
again, ^programs with different emphases and organization might interact ^ 

Cv 

differentially with class composition, making a given student's role more 

> 

comfortable or stressful. 

There are also other elements of context provided by the class, school 
and community environment for the program. Sirotnik and Cakes (1981) pro- 
vide a particularly comprehensive discussion ^f the possible components 
of schooling context.* 
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The pattern of relationships depicted in Figure 1 include the following 

(1) Students are eligible for the prograjn and are selected on the basis of 
entering characteristics. 

(2) Student entering characteristics (ability, "preferred learning style", 
^motivation to learn, ^'preparation for learning") affect performance 

at any point in time. . v , 

(3) Entering characteristics interact with program characteristics to 

^ give certain students relative advantages in certain programs (e.g., 
low ability students benefit from relative+y higher levels of teacher 
control and direction for language and mathematics mechanics). 

(4) Programs interact with school personnel characteristics [preferred styl 
personality, authority relationships, cohesi veness) . 

(5) Schooling context (ability distribution, personality, presence/absence 
of demanding/disruptive students, orderliness at class or school level) 
affects instruction (emphasis, amount of material covered, organiza- 
tion, program delivery). 

(6) Students' shared educational ,and social experiences in classrooms and 
schools depend on student entering characteristics', instruction, 
schooling context and program characteristics. • 

(7) Students from same class in year 1 -may be assigned to different classes 
in year 2. or may leave the school. 

(8) Students not present in year 1 may enter school (and thus program 
. classes) during year. 2. . 

(9) Implementation of programs may differ for year 2 from year 1. 

(10) '^Instructiopal (program) characteristics e.g., teacher "style", organi- 
zation) may differ from'year 1 to year 2 and effect of instruction 

A 

(program) year 1 followed by instruction (program) year 2 is not 
necessarily additive. 



(11) Contextual character is-tics may diff^V from year 1 to year 2. 
,(12) Conditions (1) - (5) hold for year 2 in similar fashion as for year 
1. 

(13) Program differs from "normal" standard instruction and may interact. 
Though instruction of Type^ A may be better than .instruction of Type B 
"instruction of Type B might be better for students following partici- 
pation in the program than Type A would be. 

The two areas of analytical developments to be discussed below become 
relevant in a program of the type described above for several reasons. First, 
eligibility for program participation typically depends on specific ascribed 
characteristics (e.g., poverty, bilingualism, ethnicity). Even in nominally 
"experimental" investigations, selection for participation may have non- 
random aspects at some level as in the case where the program is randomly 
assigned to a sample of schools from a pool of volunteers. A further 
complication is the non-stable participant sample; students enter and leave 
classrooms, teachers and schools drop-out of programs for various reasons. 

A second feature requiring analytical attention is the sheer number of 
eleflien;ts that potentially enter a comprehensive picture of program processes 
afid outcomes, the complexity of their interrelation, and the inherent prob- 
lems in measuring key variables by the kinds, of questionnaire, interview, 
observation and test data, typically used. All of the elements of model 
specification from a clear understanding of the question of interest through 
identification and operationalization to*approphiate analyses and interpre- 
tation have a bearing on the fidelity of tRe evaluation conclusions to the 
program's actual consequences. 

To a certain degrjee, these features align with the two analytical 
developments to be considered below. 

ERIC 



Non-Equivalent Control Group Designs/Selection Modeling 

From the inception of the large-scale educational evaluation efforts 
of tHe 1960's, evaluators have tried to employ the paradigm for experi- 
mentation in the field investigations. With rare exception, however 
(see Boruch, 1974), investigators quickly found themselves in the midst of 
non-experimental or at best quasi -experimental studies wherein all the best 
intentions about random assignment went unfulfilled. 

From a methodological perspectice, consciousness about the inadequacy 
of analytical methods in these investigations can be traced back to Campbell 
and Erlebacher's (1970) lament (perhaps complaint is the better term) that 1 
gression artifacts in quasi-experimental evaluations were causing compensa- 
tory education to look harmful. While certain aspects of the original 
Campbell-Erlebacher critique have been found to be less generally applicable 

than originally believed, the design constraints that bothered tffem remain 

♦ 

at the center of current analytical concerns. 

Basic analytical issues . Reichardt's (1979) and Barnow, Cain and 
Goldberger's (1980) discussions of the problems in analyzing non-equivalent 
•control group designs are a particularly helpful starting point for our 
examination.^ As Reichardt points out, the main issue is the effect of un- 
controlled selection on the estimation of program effects. When subjects 
are randomly assigned to programs (or non-program), groups can be considered 
initially equivalent though the equivalence can be vitiated if there is 
differentia^ attrition. Without random assignment program groups would 
not be expected to equal even in the absence of a program effect. Thus, 
in order to "equate" non-equivalent groups, it is necessary to adjust or 
control for initial differences. 

2v. 
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The analyst this juncture invariabJ-y-recognizes that the task at 
hand is to (a) identify the selection process underlying group membership 
(program, non-program) and (b) include the var-iables that determine selec- 
tion in the analysis of program effects. Ideally, this analyti.cal strategy 
would control for the effects, of initial differences. 

Until recently the statistical method typically employed by analysts 
quasi -experiments was the analysis of covariance (ANCOVA), which is 

essentially a linear regression of program outcomes, Y on program status Z, 

2 

(e.g.,^ 1 = in program, 0 not in program) and pre-program true ability W . 
Jhus the "ideal" analytical model is represented by (1) belQw; 

Y = aZ + W + e , (1) 
where a is the estimate of program effect, Z, and W is the covariance adjust- 
ment for true inftial differences. 

But as is well-known, W is unobservable. Under these conditions Barnow, 
Cain and Goldberger (1980) ask "How may the evaluator persuade an interested 
audience that the measured effect of Z on Y is free of any -contamination 
from a correlation between Z and W, given that W is not available as an 
explanatory variable?" (p. 47). Their answer to their own question is 
that "unbiasedness is attainable when the variables that determine treatment 
assignment are known, quantified and included in the equation." (Barnow, et. 
al., 1980, p. 47. See also Barnow, 1*975; Cain, 1975; and Goldberger, 1972). 
Thus if one has an observed variable t that was used to determine igroup 
assignment (in general t will be a $core based on a composite of variables, 
some of which may be correlates of W), then t may be used to replace W as 
the explanatory variable in (l)^ 

Y = V^Z + + e* (2) 
Under conditions to be 'specified, in equation (2) would be an unbiased 
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estimate of the program effect a. Thus 'either Wor t will remove the con- 
tamination which leads to "selectivity bias". 

But the question arises about whether the selection process can be 
known precisely (i.e., one is unable to quantify t). In this case, inves-' 
tigators have settled for a set of variables, X, that serve as proxies for W. 
The X's may also include variables which enter t. The equation to be 
estimated is then 

Y = Yi Z + Y^X + e . (3) 
Equation (3) is essentially the standard ANCOVA model as employed 
in the analysis of quasi-experimental data. Unfortunately, an estimate of 
Yt will in general be a biased estimate of the true program effect a. ^ 
•Statistically, this bias depends on the covariance of Z and W conditional on 
X, Moreover, contrary to Campbell and Erlebacher's (1970) assertion, the 
bias may be either positive or negative. Investigations by Goldberger 
(1972); Barnow (1973), Cain (1975), Cronbach, Rogosa, Floden, and^ Price 
(1977) and Bryk and Weisberg (1977) clearly demonstrate this property. 

To better understand the ramif icatipns of the inability to observe 
true preprogram ability (W) and/or to accurately quantify the selection 
process (t), we consider the sources of biases in estimation of program 
effects when the ANCOVA model is employed with nonequivalent groups. 
Reichardt (1980) discusses seven sources, most of which are pertinent to 
this inquiry. 

The problems due to errors in measuring the covariates (the X's in 
equation (3)) are the most frequently examined source of bias. Even when 
measurement errors are random, they lead to attenuated estimates of covariate 
effects and thus result in an underadjustment for pre-existing differences, _„ 
between different programs. The errors in the covariate cause the treatment 
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effect estimate from ANCOVA to converge toward estimates from an'ANOVA 
which completely ignore pre-existing group differences. 

The second source of bias in ANCOVA is the possibility of differential 
growth rates among identifiable subpopula tions under conditions where sub- 
population membership is related to progr am assignment. Though individuals 
from different subpopulations may be the same initially, their later dif- 
ferences may be attributed to differences in maturation. In this case, 
growth invalidates ANCOVA because within-group. growth does not completely 
account for between-.group differences in growth. 

According to Reichardt, related sources of bias due to changes between 
the time of program entry and measurement of program outcomes which are 
irrelevant to the treatment are trait instabi lity and the changing structure 
of behavior . Trait instability refers to differential variability (fluctua- 
tion) in scores over time as opposed to" average mean differences. The chang- 
ing. structure of behavior refers to the possibility that the processes that 
account for given naturally occurring behaviors vary over tiine with' different 
characteristics and processes becoming disproportionately important at 
various times. (Cronbach et al (1977) discuss this source in some detail.) ■ 

Other complications identified by Reichardt include (a) operationally 
unique pretests and posttest (i.e., even "though the measure of initial 
status and final performance is nominally the same, they are operationally 
distinct as different abilities and skills are tappe^ at different points 
in time); (b) non-linear regression lines (not pr(5perly incorporated in the 
model) and non-parallel regression lines^(due to treatment interaction 
effects, floor and ceiling effects, differential growth between groups, 
or between group differences in the reliability of the covar^-ates) . 
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Reichardt (1980) describes four approaches for ruling out selection 
differences as a rival explanation for program effects. The first three 
(namely, developing a causal model of the posttest, developing a causal model 
of the assignment process, the Cronbach et. al. (1977) combination of the 
two approaches) are basically elaborations on the identification of W, 
t, or both as described earlier* One essentially adopts a broader, theoreti- 
cally grounded ancf empirically estimated model of how posttest behavior 
Is expected to vary in the absence of the program (modeling the posttest; 
Cronbach et. al . call this identifying the "ideal covariate"), how individuals 
are assigned to "treatment" groups (modeling the assignment process; or 
identifying the "complete discriminant" in Cronbach et al.*s term^'nology) 
or do both. After determining a specific approach, there are still questions 
about appropriate analytical machinery to adjust for, measurement errors 
and estimate W and t appropriately. The sheer complexity of the adjustment 
has led some investigators to recommend the use of procedures ^derived from 
the work of Joreskog^ (1970, 1973, 1974, 1977, Joreskog and Sorbom, 1976, 
^978) for the analysis of covariance structures. These methods , attempt to 
simultaneously correct for the effects of measuren^nt error and irrelevance 
in multiple 'covariates. We withhold further discussion of these techniques 
to the next major section of our report. 



Value-added analysis . The fourth approach discussed by Reichardt 
(1980) is the modeling -of change' or growth . Promising work on this topic 
has been carried out by Bryk and Weisberg. (Bryk, 1977; Bryk and '• . 

Weisberg, 1976; Bry1<, Strenio , and Weisberg, 1980"; Strenio, 1977; Weisberg 
1978). They introduced a variety of analytical methods for estimating the 
"value-added" by program participation. Their value-added analysis is built 
"upon the notion tfiat educational programs are dynamic interventions in 
natural growth processes. Thus Bryk and Weisberg "first. modeled natural 
growth processes and then assessed program impact on the processes. 

The basic idea underlying Bryk-Weisberg value-added procedure is to 
compare average observed growth between pre^ and post-test with an estiamte 
of the amount expected in the absence of an intervention. 
To employ their techniques, one needs to have pretest (Y^^.) and post-test 
data (Yg.) on a sairfple of individuals as well a's the time (calendar dates «^ 
t^ and tg) at which observations were obtained and the age (a.^, a.g) of 
each individual akthese times. In the more general case, one'^would also 
obtain information on other background variables^ (X^. ) . Their, methods 
also seem to be applicable whether treatment is represented by a discrete 
group membership variable (treatment A vs. treatment B) or by a set of 
variables describing program and instructionaldifferences (e.g. , ^expl icit 
charicteristics of instruction, school ing^ntext, and program implemen- 
tation"), 

Bryk and Weisbergs's general model can theji be expressed as 

Y.(t) = G.(t) + R^.(t) , • (4) 

G.(t) = Tr.a.(t) + 6^ - (5) ; 

J** I •» . 
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In (4) above^ G^.(t) and R.(t) represent systematic growth and random 
components respectively., tt.. and 6. are slopes^and intercepts of individual 
growth curves, a-Ct) is the age of individual i at, time t. The X^.j are 
th# values of the jth background variable for subject i, 0. are the corres- 
ponding coefficients and are- unmeasured determinants of indiv.idual 
growth rates. Given one of several choices of assumptions about 'error 
structure (e.g., E[R.(t)] = 0; Var (R.{t)) = a , constant over all 
subjects and pmes; T?^. independent: of t, tt^. , 6^., and any R^. ; E(£^. |X^.) = 0; 
Var (£.|X.) = and Gov (e., X.) = 0), one then estimates the value- 
added by first regressing pretest on age and its interactions With back- 
ground variables to determine estimates of individual growth rate^ (u^. ) 
and then calculates a value-added for an individual usinf the expression 

\ v."^="Y.(t2) : Y^"(t^r- V-, » . (7) 

where A. represe^^ts the time interval between pretest and posttest.* The 
average of the individual val-ue added. 
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' V = ^ " (8) 

n 

is then an estimate of program impact. l 

Byrk and.Weisberg's procedures appear seductively simple and broadly 
ai3pli cable. One models the growth process as best on^Jcan from relevant 

i ^ V 

background variables ^nd the t^e span over which t^he t)rogram measurements 
are obtained then attr;ibutes the rem'fining average increment in perfor- 
mance to the program. In their most recent article (Bryk' et al., 1980), 
extensions of the basic value-added analysis modet to pases where errors in 
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regression models are heteroscedastic, growth is, non-linear, comparison . 
.group data are available, when programs are administered to non-randomly 
formed groups of individuals, and when aptitude-treatment interactions s 
are believed to exist ctre discussed. - 

Important limitations of the value-added procedure are also indicated 
by Bryk et^ aj_. (1980). The problem of a shifting metric for measuring 
-growth over time cannot be alleviated through value-added procedures. 
Whether it is simply a matter pf the restandardization of scores at differ- 
ent age and grade levels or the more serious (analytically, at least) 
concern that the component skills accentuated at different ages vary', the 
basic comp-lication falls outside the purview of a modeling procedure of 
this type. 

Another limitation is the inability of the lone value-added model to 
deal with the^lack of monotonicity of growth that occurs in schooling 

data with multiple years of Schooling separated by summer- vacations. In 
our companion report (Miller, 1981), a rudimentary ^example of this non- 
monotonicity arises in the Beginning Teacher Evaluation Study (BTES) 
data. Maddahian (1981) showed that this occurred for other BTES measures and 

Others (e.g., Klibanoff & Haggart, 1980) have uncovered similar examples 
in other evaluation, studies. It is not inherently impossible to appJy 
the value-added approach to more complex growth models; it is just unclear 
at present how one converges substantively on an adequate model for these 
more complex dynamic processes. 

There is no^ntioh in the Bryk-Weisberg work of how the investigator 
is to alleviate the proMem of measurement errors in explanatory variables. 
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While the concentration on a single group model (no comparison group) seem- 
ingly removes the concerns about differential attenuation of estimates 
the two-stage estimation process (estimate growth from pretest and predict 
growth increments to subtract from posttest) would appear to place greater 
demands for precise estimation not likely to be met by the current value- 
added approach. In principle the model should .work best during periods 
when individuals 'are experiencing substantial observed growth which suggests 
that the technique is most suitable for the^study of programs for younger 
children. But outcome measures, are notoriously less reliable and stable 
during the preschool years and early grades of formal schooling than in 
later years. 

Similarly, from a modern perspective, it is advantageous to be able to 
model program processes and examine their effects directly rather than rely 
simply on program participation as the indicator of program-^effects. As 
Bryk et^ al. (1980) demonstrate, the value-added approach can be u^ed to 
estimate the effects of program characteristics on program outcomes (I.e., 
the value-added for a given site). Yet here, too, the errors \x\ measuring 
program process characteristics as opposed to, say, ascribed individual'* 
and program characteristics are likely to inadequately reflect the true 
-state j»f affairs. 

Finally, there is no picovision in the current literature on the value- 
added approach' to deal with multiple measures of growth. Presurfiabiy;"^ 
analysts must choose some^ans of arriving at a single growth meas,ure 
(e.g. $ome form of composite) before proceeding with the value-added ' ' ^ 
analysis. The alternative is to generate a series of value-added estimates, 
one for each combination of pre- and posttests. Our s^nse is that the ^y-' 
former will typically be less than satisfactory because of the .^changing 
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character of the ideal composite over time. The latter quickly becomes 
unwieldy unless a reasonable scheme of interpreting the pattern of effects 
can be determined (e.g., see Wejsberg, 1978). . 
In conclusion we judge the value-added approach to be a useful 
:-^ddition to the complement of analytical strategies for evaluating program 
consequences. However, the biases associated with measurement errors, 
changing metrics and the changing structure of behavior linger and may, in 
certain respects, be exacerbated. Nor is the multiple measures of outcom^ 
programs adequately considered. Nonetheless, if investigators do choose to 
employ the multiple analysis strategies perspective advocated here, the 
value-added approach wiM be a wise choice for inclusion in a broad 
range of evaluation situations. 

» 
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Selection model ingT '^l^other recently developed SBt of analytical 
approaches for dealing with selection bias can be traced to evaluations 
of social experiments on welfare reform (Rossi & Lyall, 1976; Stromsdorfer 
& Farkas, 1980). Economists' wprking on these evaluations developed 
methods for adjusting for selection effects in estimating the effects of , 
interventions. Volume 5 of the Evaluation Studies Review Annual 
(Stromsdorfer & Farkas, 1980) is the most comprehensive published source 
on selection modeling methods. Representative papers from several of 
the major contributors (e.g., Hausman, Heckman, Go]dberger) are included 
along with useful discussions of the issues by the editors (Stromsdorfer 
& Farkas, 1980), and by Barnow, Cain, and Goldberger (1980). However, 
this work is rapidly developing and even recent synthetic reviews by 
Muthen (Muthen,"1981 ; Muthen & Joreskog, 1981) cannot keep up with the 
latest technical nuances. In addition a whole set of seemingly related 
techniques developed by sociologists (e,g,, Tuma & Hannan, 1978; Tuma, ^ 
Hannan, &"Groenveld, 1978) for dynamic modeling with panel data are 
not even considered by the economists. ^ • 

We will not attempt to describe all the particular analytical 

developments in our discussion of selection modeling. Instead, we try 

to indicate the ways in which the methods are designed to alleviate specific 

problems in the analysis of quasi -experimental data, point oui the broad 

categories of analytical approaches that are currently available, and 

attempt lo^pinpoint the set of problems left unresolved by these methods- 

And, although we find the methods of Xuma and Hannan potentially valuable 

for longitudinal evaluations of social programs, the discussion ^il'l 

3 

concentrate on the econometric work. 

' The general problem that motivates the selection modeling work is 
the selectivity bias that results when individuals (or, for that maitter, 

3u • 
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aggregates of individuals such as schools) are self-selected (non-randomly 

selected) into experimental and control groups (or into different program 

types) or when data on the study sample are non-randomly missing (see 

our earl ier discussion of work by psychologists on this topic (i.e., work reviewed 

by Reichardt , 1979). According to S^romsdorfer and Farkas (1980), "the 

realization that the difficulties associated' with self-selection, censored 

samples (where some variables are unmeasured for certain individuals in 

the sample), truncated samples (where all variables are unmeasured for 

certain individuals who should be in the sample), and limited dependent 

variables (variables restricted to some subset of valued: for example, 

weeks worked, which must be zero or above or the probability of being 

employed, which must lie between zero and one) all have a common foundation" 

(p, 14) was perhaps the most important statistical development -in social 

science methodology during the 1970's. This realization led investigators 

to develop methods for. incorporating analytical procedures for handling 

self-selection, censored and truncated samples, and for limited dependent 

variables within the genei^al analytical model for estimating program 

effects. 

The general analytical procedures involved' in econometric selection- ^ 
modeling can be sketched as follows. (This discussion draws heavily 
from, Barnow, Cain, and Goldberger (1980), Goldberger (1979), and Mufhen 
and Joreskog (1981).) Because of non-random assignment to program it is 
necessary to incorporate information about the selection process in^to 
the equation for estimating program effects. Thus, equation (3) for 
program outcomes, 

Y - YiZ + YoX + e** (3) , 

(remember Z represents program; Z=l for program participated and Z=0 for 
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group comparison) needs to be supplemented by an equation for selection 
into the program. A selection equatipn with Z as the dependent variable 
is specified and restrictions are placed 'on it to remove pre-existing 
.differences between program and comparison gorups from the estimates of 
the treatment effect (y-j in (3)). The restrictions on the selection 
equatipn appear to be of two types. First, there must be variabiles that 
determine selection that do not affect outcome. Thus, there must be 
variables necessary to account for Z that are not among the X's from 
equation (3).- Second, r the functional form of the relation between X and 
W (true ability as identified -in equation (1)) and a no*n -linear relation 
between Z and X are specified-. This leads td a non-linear functional 
form of X in the outcome equation that is necessary to control for any' 
relationship between Z and W that is not controlled by X, 

In more formal terms we begin with three observable variables (Y, 
/ Xfi Z), two unobservable variables (W and t, the true selection variable; 
y(:hese two are anaologous in many respects to Cronbach et aK*s ideal 
/covariate and complete discriminant) and various disturbances for the 

/ 

equations. Then 

Z = 1, if t > 0 

(0, if t < 0 (9) 
'and, as stated earlier program Outcomes are determined by 

Y = W + aZ + eg (1) 
where (e in original version of equation (1)) is normally distributed, 
independent of W and Z, and has expectation zero and standard deviation 
ag. the relations among X, W, and t prior to selection and program 
participation are given by 
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W = 9^X + £^ (10) 

t = SgX + £2 » (^1) 
where and are coefficients relating X to W and t, and disturbances 
and are bivariate-normal , uncorrelated with X and e, have standard 
deviations and and covariance 0^2- Thus, W and t may be related 
via X or through correlated disturbances. Substituting from (10) into 
(1) yields . 

Y = e^'x + aZ + 83 (12) 

where 8^=8^+ Sq and and eg bivariate normal, etc., with covariance 
^23 " ^12* (f^o*^ t^^t equations (12) and (3) are the same except for 
assumptions about ^y) Turning next to the selection equation, we see 

that Z = 1 is equivalent to + e2 > 0 which in turn implies eg ^ "®2^ 
I I t 

and Cg/og > -e X where e = eg/og. But {z^l^z^ ^ standard normal 
variable independent of X. And since Z is binary it follows that 

E(Z|X) = Prob(Z=l|X) = 1 - F(-e»X) = F(e'X) , (13) 
where F(0 is the standard normal cummulative distribution function. 
Furthermore, 

E((e2/^2^|X,Z=1) = f(e'x)/F(e'x) (14al 

and 

E((£2yo2)|X.Z=0) = f(9'x)/(l - F(e'x)) , (14b) 
where f(") denotes the sitandard normal density function. Equations 
(14a) and (14b) can be rewritten In combined form and rearranged to give 

E(( / )_-llim^F(9j(]_ 
^ ^ (1 - F(e'x))F(e'X) 

= h(X,Z;e) , ' (15) 

or, equivalently. 
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ElegjX.Z) = OghlX.Zje) . 

Also, 

E{e3lX,Z) = {a^^/al)E{i^\y.,l) = {a^2^o^)hU,l,e} . (16) 
• Given (16), the expectation- of (12) conditional on X and Z is then 

E(Y|X,Z) = e^'x + aZ + (0^2/02^^^^'^'^^ ^^^^ 
Equation (17) is the conditional expectation functioji relating observable 
values and its parameters (e-j, 0^,0^2/^2' ^ ^ 92^^2^ estimated 
by non-linear least squares. The crucial .feature of this expression Xs 
the inclusion of h(X,Z;e) which takes the conditional relationship between 
X and Z into account, thus removing a source of bias (omission of a variabl 
in estimating a, the program effect. 

In practice (17) is estimated -by a two-step profcedure (Heckman, 
1976) whereby 9 (=92/^2^ is estimated by maximum-likelihood probit 
analysis of Z on X,' these estimates are inserted in (15) to estimate 
h = h(X,Z;9) for each observation, and then 9^, a, and {o^2^^2'^ 
estimated by linear least-squares regression of Y on X, Z, and h. There 
is an alternative estimation procedure attributed to Maddala and Lee 
(1976) that operates in a similar fashion. 

The essential feature of the Heckman -Ma dda^l a -Lee procedures is 
that they resolve the problem of selectivity bias by modifying the outcome 
equation for-presumed selection process effects. As in simple ANCOyA, 
l/he. adjustment is only necessary in those conditions where treatment 
selection (Z) and true ability (W) are related after controlling for the 
observed covariates (X). Thus, if there is no relationship between 
and (a^2 ^ ^^^^ ^^^^ introduced through selection, ^and the 
more complicated selection modeling adjustments are unnecessary. 

In their review, Barnow et al . (1980) cite a number of problems 
with the selection modeling that require further attention: 
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(1) which consistent estimation procedure is best, 

(2) how to deal with severe collinearity in the second-step re- 
gression, 

(3) the effect of non-normal disturbances on the robustness of 
estimators, 

(4) misspecification of the original model, and 

(5) multiple selection rules. 

Several of these problems have since been addressed to some degree (e.g., 
see Goldberger, 1980; Heckman, 1980; and Olsen, 1979 on the effects of 
the departures from normality). 

Our reading of the current view (Muthen (198]) is the most recent 
and comprehensive we have seen) lis that the consequences are (^uite 
serious (i.e., the procedures fail to remove the selectivity bias) when 
errors in the regression relation depart from normality and/or homoscedas- 
ticity (e.V, Goldberger, 1980; Hurd, 1979; Olsen, 1979) and when the 
functional fom of the selection and/or outcome relations are misspecified. 
The latter can take several forms. For example, it may be that the true 

I 

relationship of program and ability to outcome is nonlinear though the 
specification includes only lineaf effects. Such a situation might suggest 
the need for adjustments via selection modeling when a more appropriate 
modification requires a shift to a new functional form for the relation- 
ships. • 

The second form of specification problem -that is likely to occur 
quite frequently is when relevant variables are omitted from the selectivity 
bias adjustment. In the Heckman procedures, this problem is manifested by 
leaving out variables that should be incorporated in the probit step. 
Again, the consequence is the failure to properly adjust estimates in 



the outcome equation (Muthen, 1981 reviewing work (not currently available 
for citation) by Cronbach and Goldberger). • 

'Two other concerns raised earlier about other approaches to analysis 
of quasi -expert mental data warrant mention here. - First, virtually all 
of the econometric discussions of selection modeling foucs on a single 
outcome measure. Second, the possibility of measurement errors associated 
with any of the observable variables (either Y*s or X*s) is not discuss'ed. 

Surely one would want to be able to deal with multiple outcomes and 
with latent exogeneous (explanatory) variables. At the least it would 
be helpful to state the expressions for selection and outcome modeling 
in terms of latent, rather than fallible observed variables. >/ork by 
Muthen, Joreskog, and Sorbom (Muthen & Joreskog, 1981; Sorbom, 1978, 
1981; Sorbom & Joreskog, 1981) represent initial attempts at selection 
modeling with latent exogenous variables. Essentially one first estimates 
latent variables via LISREL and then appfies the Heckman procedures 
using the latent Variables rather than the observed set of X*s. Unfortunately, 
these methods of estimating latent variables are currently restricted , 
to models with strictly continuous X variables because of their reliance 
on maximum likelihood procedures that require multivariate normality. 

The above concerns notwithstanding, the selection modeling pro- 
cedures developed by economists clearly offer improvements over the ANCOVA 
methods described earlier. Though the demands for careful thinking 
about selection mechanisms are severe, the rewards of such efforts are 
often substantial ," both analytically and substantively. 

Summary. We have described in some detail both the basis for concerns 
about bias in quasi -experimental studies and two sets of analytical develop- 
ments (the value-added approach and selection modeling) intended to remove 
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or adjust for bias. Both procedures are improvements over the past 
mainly because they employ explicit models of the phenomena believed 
to be responsible for the difficul ti^es in, estimating program effects. 
Both approaches are also adaptable to situations where there are no 
specific comparison or control groups (instead the effects of specific 
program features are to be eTstimated) and where panel data exists on 
program participants; 

Ne^ither approach directly addres^s suel> concerns as measurement 
errors in the explanatory variables, changes in the scales of measurement 
over time and changes in the structure of behavior over time. Multiple 
measures of both exogenous and endogeneous variables with- known scale 
properties are needed to gain a better grip on these problems. If these ^ 
problems can be aVleviated, selection and growth modeling can berome even 
more widely useful. 

Structural Equation Modeling 

At various points in the discussions of improvements in analyses of 
non-equivalent control group designs, we encountered lingering concerns 
about the nature of the model specification for both selection processes 
and outcomes, fallible measurements, the handling of multiple indicators, 
changing scales of measurement and changes in the structure of behavior 
over time. Resolution of the first of these concerns is never complete; 
one progresses through obtaining better understanding of the ph^omena under 
investigation (both its elements (constructs) and their interrelations). 
"Better" theories are the only answer. The combination of improvements 
in the accumulated wisdom on given phenomena (i.e., better thinking abcut 
how a program works and about its possible consequences) and better opera- ^ 
tionalization of the elements- of one's theoretical model (i.e., more compre- 
hensive and valid measurement of its constructs) are a necessary foundation 
for positive increments in the quality -of investigations of social programs. 
Analytical methods for handling the remaining concerns cited in th^ opening 
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paragraph of this section (namely fallible measurements, multiple indicator, 
changing scales of measurement and structure of behavior over time) would 
seem to be^ useful to ensure that better thinking and operationa lization is 
reflected in better data analysis' and interpretation. Such analytical 
advances would seem to be particularly pertinent to the broad conception of 
large-scale program evaluation advopated here. 

In theory, the techniques of structural equation modeling with latent 
variables (see Bentler, 1980; Bentler and Woodward, 1979; Bilby and Hauser,^ 
1979; Goidberger and Duncan, 1973; Joreskog, 1980, 1973, 1974, 19,77; Joreskog 
and Sorbom, 1976, 1978; Sofbom and Joreskog , .1981 ; Wiley, 197/1) appear to 
be particularly well-suited for resolving several of the remaining methodo- 
logical problems cited above. These techniques are. designed to estiijia^te the 
unknown coefficients in specified "causal" structures among latent (unob- 
servable).. variables.^ The references cited above provide extensive discussions 
of the current state of work on structural equation modeling including indi- 
cations of the kinds of substantive and methodological problems for^'which 
these techniques are applicable. Most of the literature addresses mainstream 
social research issues. However, there have been several applications in 
educational research contexts (see Lomax (1981) for partial bibliography 
of educational research applications; however, one of the most comprehensive 
and carefully documented applications of these methods to educational 
questions (nameiy, Munck,^1979) and recent applications with hierarchical 
data (Keesling, 1978; Wisenbaker, 1980; Wisenbaker and Schmidt, 1978) are 
not cited). ' ' 

Existing applications in large-scale educational evaluations are even 
more limited. The best known is the exchange between Magidson (1977, 1978) 
and Bentler and Woodward (1978, 1979) on the effects of Head Start. ' Abt and 
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Madiason (1980) also use structural equation itiodeling in their evaluation of 
a specific-school -reform: Sorbom.and Ooreskog (1981) discuss how these 
techniques can be applied in evaluation research. Finall/, structural equa- 

3*- 

tion modeling of latent variables is the primary analytical. method in the 
longitudinal examinations of the effects of the characteristics of the educa- 
* tional process and students' background on academic achievement during ele- 
mentary school years [conducted as part of System Development Corporation's 
(SDC) Sustaiyiing Effects Study; see Wingard, 1980] and was one of the analyti 
cal methods used. in SDC's cross-sectional study of t+ie effects of instruction 
on the achievement growth of compensatory-education students iWang, e^t. al., 
1981'). Given the prominence" (and -cost) of the Sustaining Effects Study 
among the set of recent large-scale evaluations i/i education, we are likely 
to see additional attempts to apply thesfe methods, - assuming of course the • 
continuation of large-scale qualitatively oriented evaluations,. 

We will not attempt to recount in^detail the various analytical nuances " 
of structural equations modeling with latent variables. Instead the general 
Strategy ©nployed by Joreskog and his associates in their LISREL (ynear 
- Structural, Relations) modeling will be described. We then provide .a partial 
accounting of the specific analytical problems in program evaluations that 
can be addressed, at least in part, by these methods. As with the anafytical 
developments considered earlier, we conclude with a discussion of what we 
perceive to be the main limitations of structural equation modeling in 

evaluation contexts; {^C-^ 
^ Basic approach . In currently available variants of structural equation 

• modeling, one begins with a theoretical model, about the structural (perhaps 
causal) relations among a set of pertinent latent (unobservable) constructs 
(e.g., student background and ability, program and instructional quality, 
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schooling context, student performance). One attempts to operational ize 
these constructs through the collection of information on observable indica- 
tors of each construct (say, measures of aptitudes and some quality at 
time of program entry; nneasures of program and ins€ructit)nal characteristics 
(e.g., emphasis, intensity); measures of environmental* characteristics 
(ability, composition, perceived climates); measures of cognitive, affective, 
and social outcomes). 

The information from th^se indicators has an observed' covaricince struc- 
ture^li.e., each variable yields observed estimates , of variaVi5^as well as 
'exhibiting covariation with other observed variables). One, then ^estimates 
•the relationships among latent variables and of^atent v^iriables to observed 
variables via statistical means and attempts to reconstruct the observed 
• variance-covariance Structure (matr-ix of variances and covariances) from the 
estimated variances and covariances implied by the 'theoretical specificati6n. 
A.t tjiis point one judges the acceptability of the fit cf^ the estimated struc- 
ture to the observed structure, and depending orv one^s perspective (there 
is lots of debate about what to do next), either stops or goes through anotl^e 



iteration of the specification-estimation process if the results are unsatis- 
factory. - 

~ LISREL . As we said earlier* the-LISREL model developed by Joreskog and 
associates associates (Joreskag, 1973, 1974, 1977; tloreskog and Sorbom, 
1978) is the most widely used analytical approach, to esti^nation "in structural 
equation modeling. This method handles a set^of "linear structural relations. 
"The variables in the equations system may b6 latent variables and there may 
be multiple indicators or causes of each latent variable. . .the iiiethod *al lows 
for both errors^ equations (residuals, disturbances) and errors in the 
observed variables (errors of measure;nent\ observatio^l Errors). ..yields 



estimates of the residual covfiriance matrix and the, measurement error 
covariance matrix as well as estimates of the unknown coefficients in the 
structural equations, proA^ided that all these parameters are known (Joreskog, 
1980, p. 106)" 

There are two submodels in the LISREL estimation of structural relations 
among latent variables. There is a structural model which specifies the 
relationship among latent variables. , Ih addition, there is a measureni^nt 
model which specifies the relationships of the measured variables to the 
unobserved constructs. Typically, there are multiple indicators of each 
latent construct'.' The interrelationships among the observed indicators of 
the same construct are then used to separate the presumeii underlying true 
constructs from the irrelevant and error components of each measure. 

The analyst starts with a- specification of the structural model and 
the measurement model. If the unknown parameters in both parts of the model 
are identified (i.e., there are at least as many observed variances and 
covari^nces as parameters to estimate) and if the measured variables have 

a multivariate normal distribution, maximum-likelihood estimates for the para- 

* 

meters are provided along with accompanying standard errors. There are also 
procedures for testing lack of fit for all or part 0/ the model (e.g., Bentler 
and Bonnett, 1981). More formally, the LISREL model can be specified as 
follows. Let n. = (n-|, n^y.-.n^ and 5= (5,, ^•••^) be random vector? 
of latent dependent (endogenous) variables and Independent (exogeneous) 
variables. In a simple input-process-outcome model of program impact with * 
non-experimental data, the latent variables in e might be socioeconomic 
background ( ^) qualfty of the home ( ^) and student ability ( ^). The 
latent dependent variables would be prpgram quality (n,; program quality is 
treated as endogenous because it is viewed as determined in part by the 
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rspecific input characteristics of students) and program outcomes such' as 
cognitive (n2) and §.ocial (n3) functioning. Th§ system of linear structural 
relations is given by^^ 

Bn = + C , ' (18) 

v^here B and r are coefficient matrices for the relations among endogenous 
variables (e.g., between n-j and ng) ^^nd of the exogeneous variables to the' 
' 'endogeneous varaiable (e.g., ^ to rig) and ? is a random vector of residuals 
(errors in eqi^tion, random disturbance terms). 

The vectors n and 5 are not observed. Instead we observe vectors 
Y = (Y^...,Y ) and X = (X^...Xq) which are indicators of the latent endogeneous 
and exogeneous variables, respectively. For example, program quality (n-i) 
might be measured by the'^^opportunity to learn relevant curriculum (Y^) and 
the quality of the presentation of the material (Y^). Cognitive functioning 
[t]2) might be measured by reading (Y3) and mathematics achievement tefsts 
(Y^) and social functioning by sociometric measures of friendship networks 
(Yg)^ and teacher ratings of social functioning (Ygj. Observed vindicators 
of the latent exogeneous variables might be family income (X^) and mother 
and father's education (Xg and X^) for socioeconomic background (c^j; availa- 
bility of learning resources (X^) and parental- aspirations for their child 
(Xg) for quality of the home ( and pretests on reading (Xg)-and mathe- 
matical skills (Xy) for student ability (S^). The system of equations ex- 
pressing the measurement model can be written as 

y = A n + e . ' ' . ' 

- ^y- - (19) 

* 

where A and A are matrices of regression coefficients relating q 'to v 

^y ''X ,^ . 

and £ to X, respectively and e and 6 are vectors of errors of measurement 
in y and x, respectively. 40 
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(3) Measuring changes in the scaling of variables over time_(e.g., Jore^kog, 
1979, Sorbom, 19.79a). 

(4) Detecting changes in the .structure of behavior over time (Joreskog, ■ 

1979; Shavelson, Bolus and Keesling, 1981). j 

(5) Detecting differences in the structural relations across groups U.g.> 
Bentler and Woodward, 1978; Sorbom, 1979b, 1979c), 

The first four applications select contributions targeted toward specific 
concerns that arise in quasi -experimental and non-experimental evaluation 
studies. The last application allows analysts to compare specific program 
alternatives (e.g., participation in Title I vs. Follow Through or High 
Scope vs. Direct Instruction Follow Through Models, etc.,) in a more sensi- 
.tive, comprehensive, and, we believe, sensible way. 

Limitations . Unfortunately, as with most analytical advances, there 

' r 

are important practical limitations in applying structural equation modeling 
in general and LISREL, specifically. The most serious and endemic problem 
is that the adequacy of the methods is inherently dependent on the quality 
of the model specification— both the limits of current theory (which con- 
structs are pertinent) anfl of current operationalization through the measures 
one' collects. Bad theory and bad data are no less bad simply because^we 
analyze them in a sophisticated and complicated fashion. It is unclear 
whether the consequences of these shortcomings are more severe in structural 
e'quation models though the appearance of sophistication whenever parsi- 
monious and simple examinations are flawed would seem to be a dangerous 
attribute of any analytical technii^ue. 

Another potentially serious limitation i^ the question of robustnesjs 
of LISREL to violation of multivariate normality assumptions. Current ver- 
.sions Of llSREL are not well-suited for such complications of discTrete 
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If z represents the population covariance matrix among the p and q mea- 
sured variables (13 in our hypothetical example, 6, indicaitors of endogeneous 
variables and 7 of exogeneous variables), the elements of this matrix can be 
expressed^ as functions of the elements of the four matrices of regression 
parametrics (A ,A ^B^r), the covariance^ matr^ix among^ the exogeneous late 
variables e (typically denoted by ({>), and the covariance matrices gf the 
errors in the struvtural {^) and measurement (e^ and 0^) models. In 
application some of these elements are fixed (assigned given values), others 
are constrained (unknown but equal to one or more other parameters) and the 
ranainder are free parameters to be estimated by the procedures. 

Areas of application'in evaluation contexts . In- most practical applica- 
tions of LISREL, one focusses on estimating the regression parameter matrices 

(B,r,A and A ). The ultimate intent is obviously to represent the true struc 
- - -y -X - M 

tural relationships. The specific analytical problems in program evaluation 
that LISREL can handle are those that arise in many social research settings. 
LISREL may be used to deal with a number of problems simultaneously (e.g., 
Madldson, 1977, Bentler and Woodward, 1978) or may.be restricted to handling 
a single problem (e.g., perhaps Obtaining estimates of latent variables for 
use in selection modeling, or for estimating the factqr structure among 
observable indicators). ' 
Particular applications include: 

(1) Correcting for the effects of measurement error (e.g., Keesling and^ 
Wiley) in quasi -experiments. 

(2) Taking both irrelevance (specific factors unrelated to the construct 

of interest but present in measured variables) and measurement errors 
into account (e.g.. Linn and Werts, 1977). 
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measures of independent and dependent variables (except for the multiple 
group comparison application)* Muthen (1979) has worked out procedures 
for handling certain struc|;ural models involving dichotomous variables 
(e.g., factor analysis of dichotomous variables) but they are not nearly 
as comprehensive as LISREL, Some researchers have turned to a related set 
of methods, partial least-squares (PLS), developed by Wold (see McGa rvey and 
Beptler, 1980) because they do not require the multivariate normality. 
However, in the few empirical examples currently available, the estimates 
from LISREL and PLS are not very different and the rationale for PLS remains 
more obscure. 

Despite some initial forays by Schmidt and others (Keesling, 1978; 
Schmidt, 1969; Wisenbaker, 1980; Wisenbaker and Schmidt, 1978), structural 
equation models for analyzing the hierarchical data frequently encountered 
in evaluations remain underdeveloped. It is simply too early to tell how to 
proceed in the area. . ^ 

Finally, even though the primary reason many investigators turn to 
LISREL is its ability^to^'eStimate complex models with multiple latent con- 
structs and multiple measurements, the practical reality is that LI 
estimation is often overwhelmed by the sheer size and complexity of sudh 
models. There are too many ways to go wrong. With large data sets with lots 
of parameters, practically inconsequential differences in parameters aause 
statistical fit indices to be significant (necessitating modif icatiori of 
the model). Though LISREL is capable of simultaneously estimating measure- 
ment and structural models, in practice researchers with a lan^ number of 
varaibles often have to estimate these models in separate s;tages. And the 
analyses are very expensive by current standards for cosX^of alternative, 
though simplified, analytical methods. In his analy^ of the SES study 
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of longitudinal data (Wingard (personal comnuni cation)} estimates that 
his typical computer run involving roughly 8 latent constructs with 3 to 
10 indicators each costs roughly $250 and often may not even converge to 
within acceptable limits for the maximum-likelihood estimation. 

So, again, we find ourselves with an obvious improvement in analytical 
methods that is applicable in large-scale program evaluation but is flawed 
in important respects. Clearly, structural equation modeling is a tool; 
worth having but also one that must be used cautiously. 

* 

Cone ludinq Remarks 

In our examination of two general classes of analytical methods we 
have attempted to highlight why they might be considered, how they can be 
applied, and the limitations on their application. We could bave taken 
each major area of analytical improvements in the past few years and treated 
them similarly (see, for example, the excellent review of Traub and Wolfe 
(in press) of the promise and problems in latent trait models for educational 
measuranent). 

But this is as it should be. Empirical investigations, be they ran- 
domized experiments or simply "passive observational studies", have their 
imperfections and special shortcomings. Thus, it is not surprising that 
there is no handy-dandy analytical method that solves all problems. The 
design and analysis perspective advocated here and presumably shared by 
Cook (1974, 1^81) and Cronbach et. al. (1980), (see also Burstein (1981)) 
does not require that any one method be without flaws. Instead, it is 
the weight of the evidence from multiple analyses (and reanalyses) on per- 
haps overlapping but separatable questions and sets of data that should 
guide interpretation. 
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One last caveat. After beginning our work on a^ialytical advances, we 
quickly became convinced that there were more fundamental problems in the 
area of data collection in program evaluations that greatly limit the payoff 
frcm analytical developments. In fact, we view data collection as the 

"Achilles Heel" of program evaluation, especially in tie way it vitiates 

i ^ 

the validity of data analysis and interpretation. Elsewhere we (Burstein, 
Freeman and Sirotnik, 1981) have outlined our reasons for concerns about 
data collection. At some point, methodologists working in the area of pro- 
gram evaluation will devote greater attention to data collection problems. 
If not, the next generation of evlauation studies are destined to suffer 
the fate of the last generation's despite their enhanced analytical power. 
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Footnotes 



0 

1. We simply do not subscribe to the conspirational view of the shift in 
emphasis (essentially, if you can't find significant effects, change -the 
question) as characterized in several recent accounts of the political 
history of the evaluation of social programs. Certainly, social programs 
develop a political constituency (often labeled Stakeholders), corrsfs ting of 
legislators, bureaucrats, service providers., program participants, members 
of the public as well as evaluators that have a stake in maintaining program 
activities. These prog?ams also develop enemies (political and ideological) 
and suffer through internal bickering and lack of common perspective. 

Yet the interplay of competing forces surrounding any societal activity 
that has political, economic, and social consequencj^ is the norm rather 
than the unusual. Moreover, this interplay introduces its own set of dynamics 
that affect the ac^tivity in complex and often unknown ways. Over time - 
a more refined articulation of activities (expected and actual) and their 
consequences (expected and actual) evolve. It is only natural, then, that 
the search for better understanding also shifts to more sophisticated and 
sensitive methods for explicitly linking activities with their consequences. 

2. This part of the presentation draws heavily from Barnow et. al . (1980). 

3. Tuma and Hannan's work (Tuma and Hannan, 1978; Tuma, Hannan, and Groenveld, 
1978) grounds the analysis of changes over time on a categorical dependent 
variable in a continuous-time stocha'stic model. They start with a continuous- 
time Markov model, extend it to deal with population heterogeneity (e.g., 
differences in background and program characteristics) and time dependence, 
and develop a maximum-likelihood estimation procedure for estimating the model 
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from what they call "event-histories" (data giving the number, timing and 
sequence of changes for a categorical dependent variable). These methods 
seem to be responsive to certain concerns addressed in the Bryk and Weisberg 
value-added analysis (i.e., dynamic models of change processes) as well as 
the econometric selection modeling (dealing with various selection problems 
such as attrition and systematic selection). However, the techniques are 
currently restricted to discrete outcome variables (e.g., decision to attend 
college or not; or college dropout decision) while the present review 
in restricted to evaluation studies in which the outcomes are viewed as 
essentially continuous dimensions. 

We have chosen to u-se^the term "structural equation" modeling rather than 
the label "causal" modeling more widely used in educational and psychologi- 
cal applications. In our view, the latter term attracts too much criticism 
about whether phenomena are truly "causal" as opposed to simply relational. 
This criticism detracts from the analytical potential inherent in these 
statistical aspects of the models. No one denies that practice in less than 
ideal (i.e., we never really know the causes in non-experimental studies 
(or experimental ones for that matter}- and this misspecif ication is an 
inherent property of empirical social research. Misspecif Ication, in turn, 
inevitably leads to flawed estimation. Nonetheless, one can conceive of 
a continuum of better vs. worse empirical approximations to reality. We 
contend that structural equation modeling with latent variables can poten- 
tially yield results that approach the "better" role of the continuum and 
thus should not be excluded because they are flawed (some philosopher might 
judge them "wrong".} 

' . '4.3 
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